Abstract 


Presentation 



High Performance Embedded Computing Software Initiative (HPEC-SI) 

Jeremy Kepner* (kepner@ll.mit.edu) MIT Lincoln Laboratory, Lexington, MA 02420 


Abstract 

The High Performance Embedded Computing Software 
Initiative (see www.hpec-si.org) is addressing the mili¬ 
tary need to advance the state of embedded software de¬ 
velopment tools, libraries, and methodologies to retain 
the nation’s military technology advantage in increas¬ 
ingly software-based systems. Key accomplishment in¬ 
clude completion of the first demonstration and the de¬ 
velopment of the Parallel VSIPL++ standard. Currently 
the HPEC-SI effort is on track towards its goal of chang¬ 
ing the state-of-the-practice in programming DoD HPEC 
SIP systems. 


1 Introduction 

The High Performance Embedded Computing Software 
Initiative (HPEC-SI) involves a partnership of industry, 
academia, and government organizations to foster soft¬ 
ware technology insertion demonstrations, to advance 
the development of existing standards, and to promote a 
unified computation/communication embedded software 
standard. The goal of the initiative is software porta¬ 
bility: to enable ”write-once/run-anywhere/run-anysize” 
for applications of high performance embedded comput¬ 
ing (see [7,4, 10, 8,9,18, 12]). 

This paper gives a brief overview of the HPEC-SI pro¬ 
gram objectives, technical objectives and program plans. 
Detailed progress of the demonstration, development and 
applied research activities that are taking place within the 
HPEC-SI can be found in the HPEC2002[I5, 20, 27], 
GOMAC2002[26, 5, II, 21, 23], GOMAC2003[28, 6, 
14, 17, 22], and other conferences[16, 13]. 


*This work is sponsored by the High Performance Comput¬ 
ing Modernization Office, under Air Force Contract F19628-00-C- 
0002. Opinions, interpretations, conclusions and recommendations 
are those of the author and are not necessarily endorsed by the United 
States Government. 


2 Program Objectives 

HPEC-SI is organized around demonstrations, standards 
development and applied research. Each of these activ¬ 
ities is overseen by a Working Group. The demonstra¬ 
tions team Prime contractors with FFRDC or academic 
partners to use currently defined standards, evaluate their 
performance, and report on how well their needs are be¬ 
ing met. The first demonstration was with the Common 
Imagery Processor (CIP) and successfully showed the 
use of MPI communication standard ([1]) and the VSIPL 
computation standard ([2]) to achieve portability (while 
preserving performance) across shared servers and dis¬ 
tributed memory embedded systems. The Development 
Working Group is extending the VSIPL standard to in¬ 
clude parallel object-oriented software practices already 
prototyped by the research community. This effort is 
tightly coupled with military demonstrations, and pro¬ 
vides the next generation of standards with direct feed¬ 
back from the military user base. The Applied Research 
Working Group is also taking a longer term view to as¬ 
sess the potential impact of a variety of emerging tech¬ 
nologies such as: fault tolerance and dynamic schedul¬ 
ing, self-optimization, and next generation high produc¬ 
tivity languages. 

3 Technical Objectives 

The HPEC-SI program uses three principal metrics to 
measure the progress of its efforts: 

• Portability (reduction in lines-of-code to change 
port/scale to new system); 

• Productivity (reduction in overall lines-of-code); 

• Performance (computation and communication 
benchmarks). 

Traditionally, it has always been possible to improve in 
two of the above areas while sacrificing the third. HPEC- 
SI aims to improve quantitatively in all three areas. 
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HPEC-SI expects to achieve at least a 3x reduction in 
the number code changes necessary to port an applica¬ 
tion across computing platforms. This improvement will 
primarily be achieved through the use and enhancement 
of open software standards (MPI and VSIPL) that will 
insulate applications from the details of the underlying 
hardware. An equivalent reduction in code changes will 
also be seen when porting from one size of platform to 
another. This will be achieved by the development of a 
unified computation and computation standard (Parallel 
VSIPL) which will allow applications to be moved from 
a computer with N processors to a computer with M pro¬ 
cessors with minimal code changes. 

HPEC-SI expects to achieve a 3x reduction in the total 
number of lines of code necessary to implement an appli¬ 
cation. This productivity improvement will be primarily 
be through the use of higher level object oriented lan¬ 
guages (e.g. C++) as well as a unified computation and 
communication library which will abstract away many of 
code intensive details of writing a parallel program. 

HPEC-SI expects to achieve a I.5x increase in perfor¬ 
mance over existing approaches on some computation 
and communication benchmarks. This is primarily due 
to an increased level of abstraction which allows the in¬ 
creased use of “early binding” in the application, in the 
library and in the compiler. [Early binding is the pro¬ 
cess of building data structures in advance that increase 
performance at runtime.] 

4 Summary 

The current achievements of HPEC-SI include the suc¬ 
cessful utilization of the Vector Signal and Image Pro¬ 
cessing Library (VSIPL) and the Message Passing In¬ 
terface to demonstrate a tactical synthetic aperture radar 
(SAR) code running without modifications and at high 
performance on parallel embedded, server and cluster 
systems. HPEC-SI is also creating the first parallel object 
oriented computation standard by adding these exten¬ 
sions to the VSIPL standard. The parallel VSIPL++ stan¬ 
dard will allow high performance parallel signal and im¬ 
age processing applications to take advantage of the in¬ 
creased productivity offered by object oriented program 
as well as the performance advantages found using ad¬ 
vanced expression template technology. The draft object 
oriented specification and reference code are both avail¬ 
able on the HPEC-SI website and are being tested by a 


variety of early adopters. Finally, HPEC-SI is evaluating 
advanced software technologies such as fault tolerance 
and the use of higher level languages to determine which 
aspects are ready for future standardization. Combined, 
all of these efforts are successfully changing the state-of- 
the-practice in programming DoD HPEC SIP systems. 
Critical to this effort has been the availability of a wide 
variety of HPCMO systems (Mercury, Sky, SGI, Com¬ 
paq, IBM, Linux, andFPGA) that has allowed the testing 
and demonstration of advanced software technologies for 
DoD signal and image processing applications. 
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Overview - High Performance 
Embedded Computing (HPEC) Initiative 




Challenge: Transition advanced 
software technology and practices 
into major defense acquisition 
programs 


Common Imagery Processor (CIP) 



Shared memory server 
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processor 
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(ETRAC) 
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Why Is DoD Concerned with 
Embedded Software? 







Estimated DoD expenditures 
for embedded signal and 
image processing hardware 
and software ($B) 


* COTS acquisition practices have shifted the burden from “point design” 
hardware to “point design” software 

* Software costs for embedded systems could be reduced by one-third 
with improved programming models, methodologies, and standards 
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Issues with Current HPEC Development 

Inadequacy of Software Practices & Standards 
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P-3/APSM37 


MK-48 Torpedo 


JSTARS 


Rivet Joint 


F-16 
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* 


NSSN 


System Development/Acquisition Stages 


Program 
Milestones 
System Tech. 
Development 

System Field 
Demonstration 

Engineering/ 
manufacturing 
Development 
Insertion to 
Military Asset 

Signal Processor 
Evolution 1st gen. 


4 Years 


4 Years 


4 Years 


0 0 0 


0 


A A A A 

2nd gen. 3rd gen. 4th gen. 5th gen. 


* High Performance Embedded 
Computing pervasive through DoD 
applications 

- Airborne Radar Insertion program 

85% software rewrite for each hardware 
platform 

- Missile common processor 

Processor board costs < $100k 
Software development costs > $100M 

- Torpedo upgrade 

Two software re-writes required after changes 
in hardware design 


Today - Embedded Software Is: 

* Not portable 

* Not scalable 

* Difficult to develop 

* Expensive to maintain 


MIT Lincoln Laboratory 
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Evolution of Software Support Towards 
“Write Once, Run Anywhere/Anysize” 
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DoD software 
development 


Vendor SW 


Vendor 

Software 


COTS 
development 



Vendor 

Software 


Embedded 

Standards 


Vendor 

Software 


1990 


2000 


2005 



Application software has traditionally 
been tied to the hardware 


Many acquisition programs are 
developing stove-piped middleware 
“standards” 


Open software standards can provide 
portability, performance, and 
productivity benefits 


Support “Write Once, Run 
Anywhere/Anysize” 
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Quantitative Goals & Impact 



Program Goals 

Develop and integrate software 
technologies for embedded 
parallel systems to address 
portability, productivity, and 
performance 

Engage acquisition community 
to promote technology 
insertion 

Deliver quantifiable benefits 


Portability: reduction in lines-of-code to 

change port/scale to new 
system 

Productivity: reduction in overall lines-of- 

code 

Performance: computation and 

communication benchmarks 


Demonstrate 
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HPEC-SI Capability Phases 


PfcnKwntiirtce (l.ixh 



•First demo successfully completed 



•SeconddDemo Selected 
•VSIPL++ v0.8 spec completed 
•VSIPL++ v0.2 code available 
•Parallel VSIPL++ vO.1 spec completed 
•High performance C++ demonstrated 

Phase 3 | 

Phase 2 

Applied Research: 
Hybrid Architectures 



Phase 1 

Applied Research: 

Unified Comp/Comm Lib 1 


Applied Research 

Fault tolerance 


prototype 


Development: 

Fault tolerance 


_ , Development: Par 

pr<|> o yi|>e y n jfj ec | Comp/Comm Lib vsl 


lei 

PL+ 


Demonstration: 

] Unified Comp/Comm Lib 


Development: V sipl+ 

Object-Oriented Standards 


Demonstration: 

Object-Oriented Standards 


Demonstration: 

Existing Standards 





Demonstrate insertions into 
fielded systems (CIP) 

* Demonstrate 3x portability 


High-level code 
abstraction (AEGIS) 

* Reduce code size 3x 


Unified embedded 
computation/ 
communication 
standard 

*Demonstrate scalability 


MITRE 
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* Introduction 


* Demonstration 



Common Imagery Processor 

AEGIS BMD (planned) 


• Development 

• Applied Research 
9 Future Challenges 

• Summary 
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Common Imagery Processor 

- Demonstration Overview - 



38.5” 



Common Imagery Processor (CIP) 
is a cross-service component 


Sample list of CIP modes 

U-2 (ASARS-2, SYERS) 

F/A-18 ATARS (EO/IR/APG-73) 
LO HAE UAV (EO, SAR) 
System Manager 
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Common Imagery Processor 

- Demonstration Overview - 



£ ' 


• Demonstrate standards-based platform- 
independent CIP processing (ASARS-2) 

» ~ - 

’ . , ' ■ 


• Assess performance of current COTS 
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portability standards (MPI, VSIPL) 
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• Validate SW development productivity of 



emerging Data Reorganization Interface 


* MITRE and Northrop Grumman 


Common Imagery 
Processor 







Shared-Memory Servers 



SAR IF 





Single code base 
optimized for all high 
performance architectures 
provides future flexibility 



Embedded 

Multicomputers 



Commodity Clusters 
Massively Parallel Processors 


MITRE 


MIT Lincoln Laboratory 


AFRL 


Slide-12 

www.hpec-si.org 


































ParHwiiMFifiG 


Software Ports 


Embedded Multicomputers 

• CSPI - 500MHz PPC7410 (vendor loan) 

• Mercury - 500MHz PPC7410 (vendor loan) 

• Sky - 333MHz PPC7400 (vendor loan) 

• Sky - 500MHz PPC7410 (vendor loan) 

Mainstream Servers 

• HP/COMPAQ ES40LP - 833-MHz Alpha ev6 (CIP hardware) 

• HP/COMPAQ ES40 - 500-MHz Alpha ev6 (CIP hardware) 

• SGI Origin 2000 - 250MHz RlOk (CIP hardware) 

• SGI Origin 3800 - 400MHz R12k (ARL MSRC) 

• IBM 1.3GHz Power 4 (ARL MSRC) 

• Generic LINUX Cluster 

MITRE MIT Lincoln Laboratory 
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Portability: SLOC Comparison 
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Shared Memory / CIP Server versus 
Distributed Memory / Embedded Vendor 




Application can now exploit many more processors, embedded processors 
(3x form factor advantage) and Linux clusters (3x cost advantage) 


MITRE 
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Form Factor Improvements 
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Current Configuration 



Possible Configuration 



• IOP: 6U VME chassis (9 slots potentially 
available) 

• IFP: HP/COMPAQ ES40LP 

MITRE MIT Lincoln 


• IOP could support 2 G4 IFPs 

• form factor reduction (x2) 

• 6U VME can support 5 G4 IFPs 

• processing capability increase (x2.5) 

Laboratory AFRL 
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HPEC-SI Goals 
1st Demo Achievements 



Portability: zero code changes required 
Productivity: DRI code 6x smaller vs MPI (est*) 
Performance: 2x reduced cost or form factor 


Achieved 
Goal 3x 
Portability 


Portability: reduction in lines-of-code to 

change port/scale to new 
system 

Productivity: reduction in overall lines-of- 

code 

Performance: computation and 

communication benchmarks 


Demonstrate 
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Object Oriented (VSIPL++) 
Parallel (\\ VSIPL++) 
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Emergence of Component Standards 
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Definitions 

VSIPL = Vector, Signal, and Image 
Processing Library 

||VSIPL++ = Parallel Object Oriented VSIPL 
MPI = Message-passing interface 
MPI/RT = MPI real-time 
DRI = Data Re-org Interface 
CORBA = Common Object Request Broker 
Architecture 

HP-CORBA = High Performance CORBA 

MITRE MIT Lincoln Laboratory ai-kl 
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VSIPL++ Productivity Examples 




BLAS zherk Routine 


* BLAS = Basic Linear Algebra Subprograms 

* Hermitian matrix M: conjug(M) = M f 

* zherk performs a rank-k update of Hermitian matrix C: 


C <- a * A * conjug(A)* + p * C 


• VSIPL code 


A = vs ip_cmc re a t e_d (10,15 /VS I P_ROW, MEM_NONE) ; 

C = vs ip_cmc re a t e_d (10,10 / VS I P_ROW , MEM_NONE) ; 
tmp = vs ip_cmcrea te_d (10 , 10 , VS I P_ROW , MEM_NONE) ; 
vsip_cmprodh_d(A,A, tmp) ; /* A*conjug(A) t */ 
vsip_rscmmul_d (alpha, tmp , tmp) ;/* a*A*conjug (A ) t */ 
vsip_rscmmul_d(beta A C, C) ; /* |3*C */ 
vsip_cmadd_d(tmp A C,C) ; /* a*A*conjug (A ) t + p*C */ 
vsip_cblockdestroy (vsip_cmdestroy_d (tmp) ) ; 
vsip_cblockdestroy (vsip_cmdestroy_d (C) ) ; 
vsip cblockdestroy(vsip cmdestroy d(A)); 



Sonar Example 


Matrix<complex<double> > A(10,15); 
Matrix<complex<double> > C(10,10); 
C = alpha * prodh(A,A) + beta * C; 


• K-W Beamformer 

• Converted C VSIPL to 


VSIPL++ 

• 2.5x less SLOCs 
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PVL PowerPC AltiVec Experiments 



Results 

• Hand coded loop achieves good 
performance, but is problem 
specific and low level 

• Optimized VSIPL performs well 
for simple expressions, worse 
for more complex expressions 

• PETE style array operators 
perform almost as well as the 
hand-coded loop and are 
general, can be composed, and 
are high-level 
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AltiVec loop 
VSIPL 

PETE/AltiVec 



A=B+C 


A= 


A=B+C*D+E*F 
B+C*D A=B+C*D+E/F 


Software Technology 


AltiVec loop 

VSIPL (vendor optimized) 

PETE with AltiVec 

* C 

* For loop 

* Direct use of AltiVec extensions 

* Assumes unit stride 

* Assumes vector alignment 

* C 

* AltiVec aware VSIPro Core Lite 
(www.mpi-softtech.com) 

* No multiply-add 

* Cannot assume unit stride 

* Cannot assume vector alignment 

* C++ 

* PETE operators 

* Indirect use of AltiVec extensions 

* Assumes unit stride 

* Assumes vector alignment 


Slide-21 

www.hpec-si.org 


MITRE 


MIT Lincoln Laboratory 


AFRL 












































FerHwma nee (1,5)0 


Parallel Pipeline Mapping 




Filter 

X out =FIR(X in ) 


Signal Processing Algorithm 


Data Parallel within stages 
Task/Pipeline Parallel across stages 
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Scalable Approach 



#include <Vector.h> 

#include <AddPvl.h> 

void addVectors(aMap, bMap, cMap) { 

Vector< Complex<Float> > a(‘a\ aMap, LENGTH); 
Vector< Complex<Float> > b(‘b’, bMap, LENGTH); 
Vector< Complex<Float> > c(‘c’, cMap, LENGTH); 

b = 1; 
c = 2; 
a=b+c; 

} 


Single Processor Mapping 


A = B + C 



Multi Processor Mapping 



= B + C 



Lincoln Parallel Vector Library (PVL) 

• Single processor and multi-processor code are the same 

• Maps can be changed without changing software 

• High level code is compact 
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Fault Tolerance 

Parallel Specification 

Hybrid Architectures (see SBR) 
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Dynamic Mapping for Fault Tolerance 
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• Switching processors is accomplished by switching maps 

• No change to algorithm required 

• Developing requirements for ||VSIPL++ 
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Parallel Specification 



Clutter Calculation (Linux Cluster) 



% Initialize 

pMATLAB_Init; Ncpus=comm vars.comm_size; 

% Map X to first half and Y to second half. 
mapX=map([l Ncpus/2],{}, [l:Ncpus/2]) 
mapY=map([Ncpus/2 1],{}, [Ncpus/2+l:Ncpus]); 

% Create arrays. 

X = complex (rand (N,M, mapX),rand(N, M, mapX)); 

Y = complex (zeros (N,H map Y); 

% initialize coefficients 
coefs = ... 
weights = ... 

% Parallel filter + comer turn. 

Y (:,:) = conv2 (coefs,X) ; 

% P aralleL m atr x multiply. 

Y (:,:) = weights*Y; 

% Finalize pMATLAB and exit. 
pMATLAB_Einalize; exit; 


• Matlab is the main specification language for signal processing 

• pMatlab allows parallel specifciations using same mapping 
constructs being developed for ||VSIPL++ 
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Optimal Mapping of Complex Algorithms 



Application 
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Workstation 
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Embedded 
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Embedded 
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Hardware 


* Need to automate process of mapping algorithm to hardware 
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HPEC-SI Future Challenges 



FertwiYMiWti 



Phase 3 


Time 


End of 5 
Year Plan 


Phase 5 


Applied Research: 


Hybrid Architectures F | 


Development: 
Fault tolerance 


vs? 


Demonstration: 

Unified Comp/Comm Lib 



Unified Comp/Comm 
Standard 


* Demonstrate scalability 


MITRE 
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Phase 4 


Applied Research: 

Higher Languages (Java?) 


Applied Research: 


PCA/Self-optimization 


prototype 


Development: Hy 

e Hybrid Architectures vs 



Demonstration: 
Fault tolerance 



Demonstrate 


Fault Tolerance 


* Increased reliability 


MIT Lincoln Laboratory 


Development: 

Self-optimization 


Demonstration: 
Hybrid Architectures 



ybrid 
VSIPL 



Portability across 
architectures 


• RISC/FPGA Transparency 
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Summary 



* HPEC-SI Program on track toward changing software practice 
in DoD HPEC Signal and Image Processing 

- Outside funding obtained for DoD program specific activities 
(on top of core HPEC-SI effort) 

- 1st Demo completed; 2nd selected 

- Worlds first parallel, object oriented standard 

- Applied research into task/pipeline parallelism; fault tolerance; 
parallel specification 


• Keys to success 

- Program Office Support: 5 Year Time horizon better match to 
DoD program development 

- Quantitative goals for portability, productivity and performance 

- Engineering community support 
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Presentation 


Web Links 


High Performance Embedded Computing Workshop 

http://www.ll.mit.edu/HPEC 

High Performance Embedded Computing Software Initiative 

http://www.hpec-si.org/ 

Vector, Signal, and Image Processing Library 

http://www.vsipl.org/ 

MPI Software Technologies, Inc. 
http://www.mpi-softtech.com/ 

Data Reorganization Initiative 
http://www.data-re.org/ 

CodeSourcery, LLC 
http://www.codesourcerv.com/ 

MatlabMPI 

http://www.ll.mit.edu/MatlabMPI 
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